We share this use case to show all features of rANOMALY packages that allow to highlight microbial community differences between groups of samples. rANOMALY main features are :
ASV generation thanks to dada2 algorithm
Taxonomic assignment thanks to IDTAXA from DECIPHER packages. rANOMALY allow to assign with up to 2 complementary reference database (eg. SILVA which is generalist, DAIRYdb which is specific to dairy environments), keep the best and more confident taxonomic assignment, check for taxonomy incongruency like empty fields and multiple ancestors.
phyloseq format to handle datas and downstream statistical analysis.
Decontamination step thanks to decontam package. rANOMALY allows to filter ASV (Amplicon Sequence Variant) dependings on ASV overall frequence (low abondance filter), and ASV prevalence (presence in 1 or more samples), specific filters like manual suppression of specific taxa are available to.
Complete statistical analysis workflow: rarefaction curves, community composition plots, alpha/beta diversity analysis with associated statistical tests and multivariate analysis, and differential analysis using 3 three different methods (metacoder, DESeq2, metagenomeSeq).
At each steps, plots and R objects are saved in folders to be able to relaunch the analysis and to export plots.
Each function has a detailed help accessible in R via ?{function}.
The dataset can be downloaded via this link.
This tutorial assume that you have extracted all the read file in a folder named reads along with the sample-metadata.csv file.
We share a 24 samples test dataset extracted from rats feces at two different time (t0 & t50) and in two nutrition conditions. Rats were feeded with two strains of a particular bacteria (wild type & mutant). Also, extraction control sample (blank) are included too.
sm <- read.table("sample_metadata.csv", sep="\t",header=TRUE)
DT::datatable(sm)
load("decontam_out/robjects.Rdata")
The first step is the creation of ASVs thanks to the dada2 package. In rANOMALY, only one function is needed to compute all the different steps require from this package. The function is designed to process illumina paired end sequences in fastq format.
Sample names are extracted from the file name, from the beginning to the first underscore (_), so files must be formatted as followed : {sample-id1}_R1.fastq.gz {sample-id1}_R2.fastq.gz etc… Those must match the sample names stored in the sample metadata file.
dada_res = dada2_fun(path="./reads", dadapool = "pseudo", compress=TRUE, plot=FALSE)
Main outputs:
./dada2_out/read_tracking.csv summarizes the read numbers after each filtering step.
./dada2_out/raw_otu-table.csv the raw ASV table
./dada2_out/rep-seqs.fna: fasta file with all representative sequences for each ASV
./dada2_out/robjects.Rdata with saved dada_res list containing raw ASV table and representative sequences in objects otu.table, seqtab.export & seqtab.nochim.
Read tracking table:
input: raw read number.
filtered: after dada2 filtering step: no N’s in sequence, low quality, and phiX.
denoisedF & denoisedR: after denoising. Forward & Reverse.
merged: after merging R1 & R2.
nonchim: after chimeras filtering.
DT::datatable(read.table("dada2_out/read_tracking.csv",sep="\t",header=TRUE), filter = "top")
assign_taxo_fun function uses IDTAXA function from DECIPHER package, and allows to use 2 different databases. It keeps the best assignment on 2 criteria, resolution (depth in taxonomy assignment) and confidence. The final taxonomy is validated by multiple ancestors taxa and incongruity correction step.
We share the latest databases we use in the IDTAXA format in this link. You can also generate your own IDTAXA formatted database following those instructions and scripts we provide at this page.
tax.table = assign_taxo_fun(dada_res = dada_res, id_db = c("path_to_your_banks/silva/SILVA_SSU_r132_March2018.RData","path_to_your_banks/DAIRYdb_v1.2.0_20190222_IDTAXA.RData") )
Main file outputs:
./idtaxa/robjects.Rdata with taxonomy in phyloseq format in tax.table object.
final_tax_table.csv the final assignation table that will be use in next steps.
allDB_tax_table.csv raw assignations from the two databases, mainly for debugging.
The phylogenetic tree from the representative sequences is generated using phangorn and DECIPHER packages.
tree = generate_tree_fun(dada_res)
Main outputs: - tree_robjects.Rdata with phylogenetic tree object in phyloseq format.
To create a phyloseq object, we need to merge four objects and one file:
the asv table dada_res and the representative sequences seqtab.nochim from dada2_robjects.Rdata
a taxonomy table (taxtable) taxo_robjects.Rdata from taxo_robjects.Rdata
the phylogenetic tree tree from tree_robjects.Rdata
metadata from sample-metadata.csv
data = generate_phyloseq_fun(dada_res = dada_res, tax.table = tax.table, tree = tree, metadata = "./sample_metadata.csv")
data
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 146 taxa and 24 samples ]
## sample_data() Sample Data: [ 24 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 146 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 146 tips and 144 internal nodes ]
## refseq() DNAStringSet: [ 146 reference sequences ]
Main output:
./phyloseq/robjects.Rdata with phyloseq object in data for raw counts and data_rel for relative abundance.The decontam_fun function uses decontam R package with control samples to filter contaminants. The decontam package offers two main methods, frequency and prevalence (and then you can combine those methods). For frequency method, it is mandatory to have the DNA concentration of each sample in phyloseq (and hence in the sample-metadata.csv). The prevalence method does not need DNA quantification, this method allows to compare presence/absence of ASV between real samples and control samples and then identify contaminants.
Tips: sequencing plateforms often quantify the DNA before sequencing, but do not automaticaly give the information. Just ask for it ;).
Our function integrates the basics ASV frequency (nb_reads_ASV/nb_total_reads) and overall prevalence (nb_sample_ASV/nb_total_sample) filtering. We included an option to filter out ASV based on their taxa names (for known recurrent contaminants).
data = decontam_fun(data = data, domain = "Bacteria", column = "type", ctrl_identifier = "control", spl_identifier = "sample", number = 100, krona= TRUE)
Main outputs: - robjects.Rdata with contaminant filtered phyloseq object named data. - Exclu_out.csv list of filtered ASVs for each filtering step. - Kronas plot before and after filtering. - raw_asv-table.csv & relative_asv-table.csv. - venndiag_filtering.png.
We are currently developping a ShinyApp to visualize your data, sub-select your samples/taxons and do all those analyses interactively !!! ExploreMetabar
In order to observe the sampling depth of each samples we start by plotting rarefactions curves. Those plots are generated by plotly which makes the plots interactive.
rareplot = rarefaction(data, "strain_time", 100 )
htmltools::tagList(list(rareplot))
The bar_fun function outputs a composition plot, in this example it reveals the top 10 genus present in our samples. The function allows to plot at different rank and to modify the number of taxas to show.
Ord1 option order the sample along the X axis.
Fact1 option control labels of the X axis. Fact1="sample.id" if you don’t want the sample to be renamed.
bars_fun(data = data, top = 10, Ord1 = "strain_time", Fact1 = "strain_time", rank="Genus", relative = FALSE)
bars_fun(data = data, top = 10, Ord1 = "strain_time", Fact1 = "strain_time", rank="Genus", relative = TRUE)
This function computes various alpha diversity indexes, it uses the estimate_richness function from phyloseq.
Available measures : c(“Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”).
It returns a list which contains:
a boxplot comparing conditions. ($plot)
a table of indices values. ($alphatable)
And for each of the computed indices :
an ANOVA analysis. (${measure}$anova)
a pairwise wilcox test result comparing conditions and giving the pvalue of each comparison tested. (${measure}$wilcox_col1, ${measure}$wilcox_col2_fdr, ${measure}$wilcox_col2_collapsed)
a mixture model if your data include repetition in sampling, ie. column3 option. (${measure}$anovarepeat, ${measure}$mixedeffect)
All this in a single function.
alpha <- diversity_alpha_fun(data = data, output = "./plot_div_alpha/", column1 = "strain", column2 = "time", column3 = "", supcovs = "", measures = c("Observed", "Shannon") )
The table of values for each indices you choose to compute.
pander(alpha$alphatable, style='rmarkdown')
| Observed | Shannon | |
|---|---|---|
| SB1.Sauv0 | 41 | 1.477 |
| SB10.Mut0 | 40 | 2.073 |
| SB11.Mut0 | 51 | 2.178 |
| SB12.Mut0 | 38 | 2.116 |
| SB13.Sauv50 | 46 | 2.691 |
| SB14.Sauv50 | 57 | 2.905 |
| SB15.Sauv50 | 50 | 2.793 |
| SB16.Sauv50 | 52 | 2.8 |
| SB17.Sauv50 | 49 | 2.624 |
| SB18.Sauv50 | 54 | 2.831 |
| SB19.Mut50 | 66 | 2.638 |
| SB2.Sauv0 | 26 | 2.099 |
| SB20.Mut50 | 72 | 2.721 |
| SB21.Mut50 | 79 | 3.062 |
| SB22.Mut50 | 81 | 2.81 |
| SB23.Mut50 | 84 | 3.175 |
| SB24.Mut50 | 90 | 3.148 |
| SB3.Sauv0 | 19 | 0.1962 |
| SB4.Sauv0 | 41 | 2.52 |
| SB5.Sauv0 | 46 | 1.923 |
| SB6.Sauv0 | 46 | 1.067 |
| SB7.Mut0 | 33 | 2.256 |
| SB8.Mut0 | 58 | 2.089 |
| SB9.Mut0 | 50 | 2.237 |
The boxplots of those values.
ggplotly(alpha$plot) %>%
layout(boxmode = "group")
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'
For each indices, you have access to the ANOVA test. Here we present the result for the “Observed” indice.
pander(alpha$Observed$anova)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Depth | 1 | 49.36 | 49.36 | 0.5091 | 0.4838 |
| strain | 1 | 1877 | 1877 | 19.36 | 0.0002764 |
| time | 1 | 3649 | 3649 | 37.64 | 5.392e-06 |
| Residuals | 20 | 1939 | 96.96 | NA | NA |
Wilcox tests are made on each factor you have entered, and the combination of the two. Here “strain” and “time”.
pander(alpha$Observed$wilcox_col1)
| mutant | |
|---|---|
| wildtype | 0.043 |
pander(alpha$Observed$wilcox_col2_fdr)
| t0 | |
|---|---|
| t50 | 0.001 |
pander(alpha$Observed$wilcox_col2_collapsed)
| mutant_t0 | mutant_t50 | wildtype_t0 | |
|---|---|---|---|
| mutant_t50 | 0.002 | NA | NA |
| wildtype_t0 | 0.377 | 0.005 | NA |
| wildtype_t50 | 0.336 | 0.002 | 0.008 |
The diversity_beta_fun function is an exhaustive function allowing to generate all possible tests and ordination in one command.
The diversity_beta_light function is equivalent to the first fonction but allows to generate specific tests and figures ready to publish in rmarkdown as in the example below. It is based on the vegan package function vegdist for the distance calculation and ordinate for the ordination plot.
We include statiscal test to ease the interpretation of your results. Here we include permutational ANOVA to compare groups and test that the centroids and dispersion of the groups are equivalent for all groups. User can inform col and cov arguments to assess PERMANOVA to determine significant differences between groups (eg. factor “strain” here and covariable “time”). A pairwise-PERMANOVA is processed to determine which conditions are significantly different from another (based on p-value).
As return, you will get an object that contains:
An ordination plot ($plot)
The permANOVA results ($permanova)
The pairwise permANOVA ($pairwisepermanova)
# beta <- diversity_beta_light(data = data, output = "./plot_div_beta/", glom = "ASV", column1 = "time", column2 = "strain", covar ="")
beta_strain = diversity_beta_light(data, col = "strain", cov="time", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_strain/", tests = TRUE)
beta_time = diversity_beta_light(data, col = "time", cov="strain", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_time/", tests = TRUE)
beta_strain_time = diversity_beta_light(data, col = "strain_time", dist0 = "bray", ord0 = "MDS", output="./plot_div_beta_strain_time/", tests = TRUE)
htmltools::tagList(list(ggplotly(beta_strain$plot), ggplotly(beta_time$plot), ggplotly(beta_strain_time$plot)))
pander(beta_strain$permanova)
| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Depth | 1 | 0.5075 | 0.5075 | 3.122 | 0.06845 | 0.01199 |
| time | 1 | 2.185 | 2.185 | 13.44 | 0.2946 | 0.000999 |
| strain | 1 | 1.471 | 1.471 | 9.049 | 0.1984 | 0.000999 |
| Residuals | 20 | 3.251 | 0.1626 | NA | 0.4385 | NA |
| Total | 23 | 7.415 | NA | NA | 1 | NA |
pander(beta_strain_time$pairwisepermanova)
| pairs | Df | SumsOfSqs | F.Model | R2 | p.value |
|---|---|---|---|---|---|
| wildtype_t0 vs mutant_t0 | 1 | 0.9528 | 4.64 | 0.317 | 0.008 |
| wildtype_t0 vs wildtype_t50 | 1 | 2.021 | 28.97 | 0.7434 | 0.003 |
| wildtype_t0 vs mutant_t50 | 1 | 2.197 | 26 | 0.7223 | 0.003 |
| mutant_t0 vs wildtype_t50 | 1 | 1.681 | 11.59 | 0.5369 | 0.003 |
| mutant_t0 vs mutant_t50 | 1 | 1.57 | 9.826 | 0.4956 | 0.003 |
| wildtype_t50 vs mutant_t50 | 1 | 1.818 | 75.22 | 0.8827 | 0.004 |
| p.adjusted | sig |
|---|---|
| 0.008 | * |
| 0.0045 | * |
| 0.0045 | * |
| 0.0045 | * |
| 0.0045 | * |
| 0.0048 | * |
We choose three different methods to process differential analysis which is a key step of the workflow. The main advantage of the use of multiple methods is the cross validation of differentially abundant taxa between tested conditions.
Metacoder is the most simple differential analysis tool of the three. Counts are normalized by total sum scaling to minimise the sample sequencing depth effect and metacoder use a Kruskal-Wallis test to determine significant difference between sample groups. The metacoder_fun function allows the user to choose the taxonomic rank, which factor to the test (column1), and a specific pairwise comparison (comp) to launch the differential analysis.
It also produces pretty graphical trees, representing the taxa present in both groups and coloring the branches depending on the group in which this taxa is more abundant. Two of those trees are produced, the raw one and the non significant features filtered one (p-value <= 0.05).
out1 = metacoder_fun(data = data, output = "./metacoder", column1 = "strain_time", rank = "Family", signif = TRUE, plottrees = TRUE, min ="10", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
DT::datatable(out1$table, filter = "top", options = list(scrollX = TRUE))
out1$wildtype_t0_vs_mutant_t0$signif
## [[1]]
out1$wildtype_t50_vs_mutant_t50$signif
## [[1]]
DESeq2 is a widely used method, primarily for RNAseq applications, for assessing differentially expressed genes between controlled conditions. Its use for metabarcoding datas is sensibly the same. The deseq2_fun allows to process differential analysis as metacoder_fun, and users can choose the taxonomic rank, factor to test and which conditions to compare. DESeq2 algorithm uses negative binomial generalized linear models with VST normalization (Variance Stabilizing Transformation).
Main output is a list with :
#fig.keep='all', fig.align='left', fig.width = 15, fig.height = 10
out2 = deseq2_fun(data = data, output = "./deseq/", column1 = "strain_time", verbose = 1, rank = "Family", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(out2$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(out2$wildtype_t50_vs_mutant_t50$table, filter = "top", options = list(scrollX = TRUE))
MetagenomeSeq use a normalization method able to control for biases in measurements across taxonomics features and a mixture model taht implements a zero-inflated Gaussian distribution to account for varying depths of coverage. As deseq2_fun, metagenomeseq_fun returns table with statistics and plots with significant features for each comparison.
out3 = metagenomeseq_fun(data = data, output = "./metagenomeseq/", column1 = "strain_time", verbose = 1, rank = "Family", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(out3$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(out3$wildtype_t50_vs_mutant_t50$table, filter = "top", options = list(scrollX = TRUE))
The aggregate_fun function allows to merge the results from the three differential analysis methods used before, to obtain one unique table with all information of significant differentially abundant features.
The generated table include the following informations :
seqid: ASV ID
Comparaison: Tested comparison
Deseq: differentially abundant with this method (0 no or 1 yes)
metagenomeSeq: differentially abundant with this method (0 no or 1 yes)
metacoder:differentialy abundant with this method (0 no or 1 yes)
sumMethods: sum of methods in which feature is significant.
DESeqLFC: Log Fold Change value from DESeq2
absDESeqLFC: absolute value of Log Fold Change value from DESeq2
MeanRelAbcond1: Mean relative abundance in condition 1
MeanRelAbcond2: Mean relative abundance in condition 2
Condition: in which the mean feature abundance is higher.
Taxonomy & representative sequence.
resF = aggregate_fun(data = data, metacoder = "./metacoder/metacoder_signif_Family.csv", deseq = "./deseq/", mgseq = "./metagenomeseq/", output = "./aggregate_diff/", column1 = "strain_time", column2 = NULL, verbose = 1, rank = "Genus", comp = "wildtype_t50~mutant_t50,wildtype_t0~mutant_t0")
ggplotly(resF$wildtype_t0_vs_mutant_t0$plot)
ggplotly(resF$wildtype_t50_vs_mutant_t50$plot)
DT::datatable(resF$table, filter = "top", options = list(scrollX = TRUE))
Our PLS-DA/sparse PLS-DA analysis is based on the mixOmics package. These are supervised classification methods. They allow to define most important features discriminating our groups. The plsda_res function outputs a list of graphs and tables:
plsda$plotIndiv ordination plot from the PLSDA analysis.
splsda$plotIndiv ordination plot from the sparse PLSDA analysis.
loadings$comp{n} all the loadings plots for each composant from the sPLSDA analysis. Loadings weights are displayed and colors correspond to the group in which each feature has its maximum mean value.
splda.loading_table Loading of all taxas for each composant.
splsda$plotArrow arrow plot of samples.
plsda_res <- plsda_fun(data = data, output = "./plsda_family/", column1 = "strain_time", rank = "Family")
plsda_res$splsda.plotIndiv$graph
knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp1.png')
knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp2.png')
knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp3.png')
knitr::include_graphics('plsda_family/splsda_loadings_strain_time_Family_comp4.png')
heatmap_fun allows to plot relative abundance of more abundant taxas (20 by default). Use can choose which taxonomy rank to plot and the factor used to seprate plot.
heatmap_plot = heatmap_fun(data = data, column1 = "strain_time", top = 20, output = "./plot_heatmap/", rank = "Species")
ggplotly(heatmap_plot)
csv2phyloseq_fun allows to import the 4 files in tabulated format to generate phyloseq object ready to use with rANOMALY.
assign_fasta_fun is used to assign sequences from fasta files.
export_to_stamp_fun allows you to generate input files for STAMP.
split_table_fun allows to split the phyloseq object in multiple sub-object according to one column.
update_metadata_fun allows you to easily modify sample_data part of the phyloseq object.